A Simple Closed-Class/Open-Class Factorization for Improved Language Modeling
نویسندگان
چکیده
We describe a simple improvement to ngram language models where we estimate the distribution over closed-class (function) words separately from the conditional distribution of open-class words given function words. In English, function words account for about 30% of written language, and also form a natural skeleton for most sentences. By factoring a language model into a function word model and a conditional model over open-class words given function words, we largely avoid the problem of sparse training data in the first phase, and localize the need for sophisticated smoothing techniques primarily to the second conditional model. We test our factored approach on the Brown and Wall Street Journal corpora and observe a 3.5% to 25.2% improvement in perplexity over standard methods, depending on the particular smoothing method and test set used. Compared to other proposals for improving n-gram language models, our factorization has the advantage of inherent simplicity and efficiency, and improves generalization between data sets.
منابع مشابه
THD Analysis in Closed-Loop Analog PWM Class-D Amplifiers
In this paper, we investigate the parameters affecting Total Harmonic Distortion (THD) and Power Supply Rejection Ratio (PSRR) in PWM Class D Amplifiers (CDAs) on the basis of linear models with feedback. From our mathematical analysis, we show that the THD of a PWM Class D amplifier with feedback can be improved by increasing the gain of the integrator through adding another amplifier at the o...
متن کاملEffect of membrane on power density of ethanol/O2 biofuel cell
A biofuel cell is a device for converting chemical energy to electrical energy by a simple way. A high-impact anode is prepared in this research. Here, carboxylated multiwall carbon nanotube (COOH-MWCNT), polydiallyldimethyl ammonium chloride (PDDA) and alcohol dehydrogenase were cast on modified glassy carbon with polymethylene green to construct the bioanode for ...
متن کاملA Simple Unsupervised Learner for POS Disambiguation Rules Given Only a Minimal Lexicon
We propose a new model for unsupervised POS tagging based on linguistic distinctions between open and closed-class items. Exploiting notions from current linguistic theory, the system uses far less information than previous systems, far simpler computational methods, and far sparser descriptions in learning contexts. By applying simple language acquisition techniques based on counting, the syst...
متن کاملTraceability and Factorization in Class Diagrams: an Experimentation of their Correlation
In this article, we present a study of the correlation between factorization and the quality criterion of traceability. Our work is based on a set of new factorization metrics and a specific definition of traceability. The results of our experiment show a good correlation between the increase of the factorization of a UML class diagram and its traceability.
متن کاملInvestigating Foreign Language Enjoyment and Public Speaking Class Anxiety in the EFL Class: A Mixed Methods Study
Foreign language enjoyment and speaking anxiety in the classroom are two potential emotion-inducing factors for foreign language learning. This sequential mixed methods study investigated whether and to what extent Iranian EFL students experienced foreign language enjoyment and public speaking anxiety in their English classrooms, and how they characterize the sources of the enjoyment and speaki...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001